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Motion Cloth 


/h»TION 


• Cloth simulation developed by Ubisoft 
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Agenda 

• Cloth Simulation 
Performance Post-mortem 



What isthe solution? 

L. 


Journey from C++ to Compute Shaders 
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Cloth simulation performance post-mortem 

• The cloth simulation itself is quite fast 

• But it requires a lot of processing before and after 


A 

ASSASSIN'S 

-CREED — 

Unity 

Simulation 

~ 40% 

Pre- & Post-simulation 

™ 60% 
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Cloth simulation performance post-mortem 

• Skinning 

• Interpolation system 

• Mapping 

• Tangent space 

• Critical path 
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Skinning 


In an ideal world: 

• Set a material on the cloth 


Let the simulation do the job 
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Skinning 


In practice: 

• We need to control the cloth 

• The cloth must look impressive even when the 
character's movement is not physically realistic 

• The skinned vertices are heavily used to control the 
cloth 
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Skinning 


Maximum distance constraints: 

• Maximum displacement of each 
vertex 



• Relatively to its skinned position 


Controlled by a vertex paint layer 


IfPI 



GAME DEVELOPERS CONFERENCE® 2015 


MARCH 2-6, 2015 GDCONF.COM 


Skinning 



• The simulated vertex can 
move inside a sphere centered 
around the skinned vertex 

• The radius of the sphere 
depends on the color at the 
vertex in the vertex paint layer 
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Skinning 
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Skinning 


Skinning is also used by 
• Blend constraints 


Levels of detail 
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Skinning 

We definitely need to compute skinning 

• Compute on the GPU then transfer 

W Serious synchronization issues 

• Compute on the CPU 

W Most of the time before the simulation 
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Cloth simulation performance post-mortem 


Skinning 


• Interpolation system 


Cloth simulation 
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Interpolation system 

Game frame rate * simulation frame rate 
Game frame rate: 

• Usually locked to 30 fps 

• But can be lower in a few specific places on consoles 

• Can be lower and fluctuate on PC 

• Also fluctuates a lot during the production of the game 



Interpolation system 


Game frame rate * simulation frame rate 
Simulation frame rate: 






Must be fixed (limitation of the algorithm) 
30 fps if no collision or slow pace 


Flags, walking characters 

60 fps if fast moving collision objects 

Running or playable characters 
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Interpolation system 


Cloth simulation called several times per frame 


Interpolate: 

• The skinned vertices (position and normal) 

• Collision objects (position and orientation) 
W Still quite cheap compared to skinning 
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Cloth simulation performance post-mortem 


• Skinning 

• Interpolation system 

• Mapping 




Skinning ■ 
Interpolation^^ 

Cloth simulation 
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Mapping 


WHAT? 


• Map a high-res visual mesh 

• To a lower-res simulated 
mesh 
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Mapping 


WHY? 


• Simulating a high-res mesh is too costly 

• It doesn't give good results 

4 Too silky, too light 

• Ability to update the visual mesh without breaking 
the cloth setup 
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Mapping 


COST? 


• Compute position and normal of each visual vertex 

• Mapping ~ lOx faster than simulation 

• But high-res mesh can have lOx more vertices! 

Up to same cost or even higher in worst cases 
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Cloth simulation performance post-mortem 


Tangent space 


Skinning 


Interpolation 

1 


i 

Cloth simulation 

Mapping 
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Tangent space 

• Tangent space is required for normal mapping 
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Tangent space 

• Tangent space is required for normal mapping 

Most of the time taken 
after the simulation 

• Compute it on the GPU 

W Requires specific shaders 



Costly 
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Cloth simulation performance post-mortem 


Skinning ■ 

Interpolation^^ 

Cloth simulation 
Mapping I 


Critical path 


Tangent space 
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Critical path 


WHAT IS CRITICAL PATH? 



Thread 1 
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Critical path 


Adding a task on the critical path 

Bigger duration for the game engine loop 



• Adding a task outside the critical path 

Doesn't change the engine loop's duration 

It's "free" • Unless task is too big 

• Unless perfect balancing 
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Critical path 

Is cloth simulation on the critical path? 



• Scenario 1: cloth doesn't need skinning 
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Critical path 

Is cloth simulation on the critical path? 

• Scenario 1: cloth doesn't need skinning 

• Dependency: 



Cloth simulation 


Rendering 


Not on the critical path 
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Critical path 

Is cloth simulation on the critical path? 
• Scenario 2: cloth does need skinning 



Procedural 

Animation 



Physics Gameplay 


Animation 


Skinning 


■M 


Cloth 

simulation 


-fcl 


Rendering 
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Critical path 

Is cloth simulation on the critical path? 

• Scenario 2: cloth does need skinning 

W Most of the time on the critical path 

Consequence: 

b 





Hey! The game 
is too slow! 


Use more aggressive cloth 
levels of detail, and it's fixed! 
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• Cloth Simulation 
Performance Post-mortem 


What isthe solution? 




GAME DEVELOPERS CONFERENCE® 2015 


Peak power: XbOX One 


Gflops 
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PS4 



CPU 


GPU 
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• Journey from C++ to Compute Shaders 
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Journey from C++ to Compute Shaders 

• The first attempts 

• A new approach 

• The shader - Easy parts - Complex parts 

• Optimizing the shader 

• The PS4 version 

• What you can & cannot do in compute shader 

• Tips & Tricks 
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The first attempts 


Integrate velocity 


Resolve some constraints 


Resolve collisions 


Resolve some more constraints 


Do some other funny stuffs 
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r 


r 


Compute Shader 


Compute Shader 


V 

( \ 

Compute Shader 

V 


r 


V 


r 


Compute Shader 


Compute Shader 


Compute Shader 
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The first attempts 



The GPU version is 20x 
slower than the CPU version 

Too many "Dispatch" calls 


Bottleneck = CPU 
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The first attempts 


• Merge several cloth items to 
get better performance 

• It's better, but it's not enough 

• Problem : all cloth items must 
have the same properties 
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140 % 

120% 

100% 

80 % 

60 % 

40 % 

20% 

0% 




CPU GPU 
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Journey from C++ to Compute Shaders 




A new approach 
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A new approach 


• A single huge compute shader to 
simulate the entire cloth 

• Synchronization points inside the shader 
A single "Dispatch" call instead of 50+ 

• Simulate several cloth items (up to 32) 
using a single "Dispatch" call 





The GPU version is now faster than the 
CPU version 
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Journey from C++ to Compute Shaders 


The shader - Easy parts - Complex parts 
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The shader 

• 43 . hlsl files 

• 3,400 lines of code 

(+ 800 lines for unit tests & benchmarks) 

• Compiled shader code size = 75 KB 
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The shader - Easy parts 

• Thread group: 

0 12 3 4 5 63 

uuuuuu ■ ■ ■ u 

• We do the same operation on 64 vertices at a time 

W There must be no dependency 
between the threads 
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The shader - Easy parts 


Read some global properties to apply (ex: gravity, wind) 


Read position 
of vertex 0 

Read position 
of vertex 1 


Compute 

Compute 


Write position 
of vertex 0 

Write position 
of vertex 1 


Read position 
of vertex 63 


Compute 


Write position 
of vertex 63 
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The shader - Easy parts 


Read some global properties to apply (ex: gravity, wind) 


Read position 
of vertex 64 

Read position 
of vertex 65 

■ ■ ■ 

Read position 
of vertex 127 





Compute 

Compute 

■ ■ ■ 

Compute 





Write position 
of vertex 64 

Write position 
of vertex 65 

■ ■ ■ 

Write position 
of vertex 127 
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The shader - Easy parts 


Read position 
of vertex 0 

Read position 
of vertex 1 


Read property 
for vertex 0 

Read property 
for vertex 1 


Compute 

Compute 


Write position 
of vertex 0 

Write position 
of vertex 1 


Read position 
of vertex 63 


Read property 
for vertex 63 


Compute 


Write position 
of vertex 63 
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The shader - Easy parts 


Read property 


Read property 

■ ■ ■ 

Read property 


for vertex 0 


for vertex 1 


for vertex 63 







t 

Ensure contiguous reads to get good performance 


W Coalescing = 1 read instead of 16 

i.e. use Structure of Arrays (SoA) instead of Array 
of Structures (AoS) 




The shader - Complex parts 

• A binary constraint modifies 
the position of 2 vertices 

^ Constraint 

I 

Vertex B 


Vertex A 
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The shader - Complex parts 


• Binary constraints: 



• 4 constraints updating the position of the same vertex 

4 threads reading and writing at the same location 
Undefined behavior 
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The shader - Complex parts 


• Binary constraints: 



Group l 

GroupMemoryBarrierWithGroupSyncO 

Group 2 

GroupMemoryBarrierWithGroupSyncO 
Group 3 

GroupMemoryBarrierWithGroupSync() 
Group 4 
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The shader - Complex parts 


• Collisions: Easy or not? 

• Collisions with vertices w Easy 

• Collisions with triangles 

w Each thread will modify the 
position of 3 vertices 

w You have to create groups 
and add synchronization 
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Journey from C++ to Compute Shaders 




Optimizing the shader 
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Optimizing the shader 


• General rule: 


Bottleneck = memory bandwidth 
• Data compression: 
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Optimizing the shader 

• Use Local Data Storage (aka Local Shared Memory) 



LDS 





Compute Unit 
(12 on Xbox One, 
18 on PS4) 


VRAM 
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Optimizing the shader 

• Store vertices in Local Data Storage 


Copy vertices from VRAM to LDS | 


Step 1 - Update vertices 


Step 2 - Update vertices 


j 

D 


Step n - Update vertices 


) 


Copy vertices from LDS to VRAM 
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Optimizing the shader 


Use bigger 
thread groups: 

. With 64 
threads, the 
GPU is waiting 
for the memory 
most of the 
time 


0 12 3 4 5 63 

BLLLHB - I 



Compute 



Compute 
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Optimizing the shader 


Use bigger 
thread groups: 

. With 256 or 
512 threads, 
we hide most 
of the latency! 


0 12 3 4 5 63 

HLLLLJU-" ■ 


Load 


/ ^ 


Compute 


64 127 

wamam-m 


Load 


r . ■ — ^ 

Compute 


But... 
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Optimizing the shader 


0 12 3 4 5 63 

LIJJ JJ J ■ ■ 'Ll 



Number of vertices usually not 
a multiple of 64 




Dummy vertices 
Useless work! 
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Optimizing the shader 

0 12 3 4 5 63 64 127 

LULULU ■ ■ -ULLLUJU ■ ■ -U 



W Bigger thread group = more dummy vertices 
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Optimizing the shader 

0 12 3 4 5 63 64 127128 191192 255 

LUJJJJ ■ ■ -ULUJJJJ ■ ■ -U LUJJJJ ■ ■ -ULLLLUJ ■ ■ -U 




W Bigger thread group = more dummy vertices 
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Optimizing the shader 
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Optimizing the shader 

To get the best performance: 

• Use several shaders with different thread group 
sizes 


Use the most efficient shader depending on the 
number of vertices of the cloth 
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Optimizing the shader 
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Journey from C++ to Compute Shaders 


The PS4 version 



GAME DEVELOPERS CONFERENCE® 2015 


MARCH 2-6, 2015 GDCONF.COM 


The PS4 version 


• Porting from HLSL to PSSL is easy: 


#ifdef PSSL 


numthreads 

NUM THREADS 

SV_Grouplndex 

S GROUP INDEX 

SV_GrouplD 

S_GROUP_ID 

Structured Buffer 

RegularBuffer 

RWStructuredBuffer 

RW_RegularBuffer 

ByteAddressBuffer 

ByteBuffer 

RWByteAddressBuffer 

RW_ByteBuffer 

GroupMemoryBarrierWithGroupSync ThreadGroupMemoryBarrierSync 

groupshared 

thread_group_memory 
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The PS4 version 

. On DirectX 11: Copy 


Buffer 




/ 


Compute 

shader 



Synchronization 
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CopyResource 


Buffer 




Compute 

shader 


Synchronization 
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The PS4 version 

. On PS4: 


No implicit synchronization, no implicit buffer duplication 
You have to manage everything by yourself 



Potentially better performance because you know when 
you have to sync or not 



Also available on Xbox One 
(use fast semantics contexts) 
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The PS4 version 

• We use labels to know if a buffer is still in use 
by the GPU 

• Still used -> Automatically allocate a new buffer 

• "Used" means used by a compute shader or a copy 

• We also use labels to know when a compute shader 
has finished, to copy the results 
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Journey from C++ to Compute Shaders 




What you can & cannot do in compute shader 
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What you can do in compute shader 


Peak power: XbOX One 


P^4 


Gflops 

1800 
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1400 

1200 

1000 

800 
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400 

200 

0 



CPU GPU 
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What you can do in compute shader 

# Using DirectCompute, you can do almost 
everything in compute shader 

a The difficulty is to get good performance 
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What you can do in compute shader 


• Efficient code = you work on 64+ data at a time 

• If you have less data: 
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What you can do in compute shader 

• Example: collisions 

• On the CPU: 

Compute a bounding volume 

(ex: Axis-Aligned Bounding Box) 


Use it for an early rejection test 

Use an acceleration structure 
(ex: AABB Tree) to improve performance 
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What you can do in compute shader 


• Example: collisions 

• On the GPU: 



Compute a bounding volume 

(ex: Axis-Aligned Bounding Box) 



t 

Just doing this can be more costly than 
computing the collision with all vertices!!! 
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What you can do in compute shader 


• Compute 64 sub-AABoxes 


0 12 3 4 5 63 


UJJ. 


j l 


J---U 


0 



r 

i ii i i i i i 



■■■■■■■ 
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What you can do in compute shader 


• Compute 64 sub-AABoxes 012345 63 

• Reduce down to 32 sub-AABoxes JLJ- - -U 



We use only 32 
threads for that 
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What you can do in compute shader 

• Compute 64 sub-AABoxes 012345 63 

• Reduce down to 32 sub-AABoxes 
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What you can do in compute shader 

• Compute 64 sub-AABoxes 012345 63 

• Reduce down to 32 sub-AABoxes -!_)■■ 

• Reduce down to 16 sub-AABoxes 1 11111111 

• Reduce down to 8 sub-AABoxes LXXXJHH 1 



We use only 8 
threads for that 
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What you can do in compute shader 

• Compute 64 sub-AABoxes 012345 63 

• Reduce down to 32 sub-AABoxes -!_)■■ 

• Reduce down to 16 sub-AABoxes 1 11111111 

• Reduce down to 8 sub-AABoxes LXXXJHH 1 

• Reduce down to 4 sub-AABoxes 



We use only 4 
threads for that 



GAME DEVELOPERS CONFERENCE® 2015 


MARCH 2-6, 2015 GDCONF.COM 


What you can do in compute shader 

• Compute 64 sub-AABoxes 012345 63 

• Reduce down to 32 sub-AABoxes -!_)■■ 

• Reduce down to 16 sub-AABoxes 1 11111111 

• Reduce down to 8 sub-AABoxes LXXXJHH 1 

• Reduce down to 4 sub-AABoxes 

• Reduce down to 2 sub-AABoxes^ we use only 2 

threads for that 
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What you can do in compute shader 

Compute 64 sub-AABoxes 012345 63 

Reduce down to 32 sub-AABoxes JLJ- ■ -LI 

Reduce down to 16 sub-AABoxes 
Reduce down to 8 sub-AABoxes 
Reduce down to 4 sub-AABoxes 
Reduce down to 2 sub-AABoxes 
Reduce down to 1 AABox 


I I I I I I I 


We use a single 
thread for that 
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What you can do in compute shader 


Compute 64 sub-AABoxes 
Reduce down to 32 sub-AABoxes 
Reduce down to 16 sub-AABoxes 
Reduce down to 8 sub-AABoxes 
Reduce down to 4 sub-AABoxes 
Reduce down to 2 sub-AABoxes 
Reduce down to 1 AABox 


This is ~ as 
costly as 
computing the 
collision with 
7 x 64 = 448 
vertices! ! 
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What you can do in compute shader 

• Atomic functions are available 
w You can write lock-free thread-safe containers 


Too costly in practice 

The brute-force approach is 
almost always the fastest one 
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What you can do in compute shader 

Conclusion: 


Port an algorithm to the GPU 
only if you find a way 
to handle 64+ data at a time 
95+% of the time 
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Journey from C++ to Compute Shaders 


Tips & Tricks 
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Sharing code between C++ & hlsl 


( WIN32) 

|| (_WIN64) 

|| ( DURANGO) || ( ORBIS ) 


uint 

float2 { 

x, y; }; 

float3 { 

X, y, z; }; 

float4 { 

x, y, z, w; }; 

uint2 { 

x, y; }; 

uint3 { 

X, y, w; }; 

strut uint4 { 

X,y, z, w; }; 
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What to put in LDS? 
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Memory consumption in LDS 


• LDS = 64 KB per compute unit 

• 1 thread group can access 32 KB 

2 thread groups can run 
simultaneously on the same 
compute unit 




• Less memory used in LDS 

w More thread groups can run in parallel 
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Memory consumption in LDS 

• LDS = 64 KB per compute unit 

• 1 thread group can access 32 KB 

• Less memory used in LDS 

More thread groups can run in parallel 

• 256- or 512-thread groups: No visible impact 

• 64- or 128-thread groups: 

Visible impact on performance 
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Optimizing bank access in LDS? 

• LDS is divided into several banks (16 or 32) 

• 2 threads accessing the same bank -> Conflict 

Visible impact on performance on older PC 
hardware 

Negligible on Xbox One, PS4 and newer PC 
hardware 
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Beware the compiler 

CopyFromVRAMToLDSO; 

ReadlnputFromLDSO; 
DoSomeComputationsO ; 
WriteOutputToLDS() ; 

ReadlnputFromLDSO; 

DoSomeComputationsO ; 
WriteOutputToLDSO ; 

//CopyFromLDSToVRAMO; 




GAME DEVELOPERS CONFERENCE® 2015 


MARCH 2-6, 2015 GDCONF.COM 


Beware the compiler 


CopyFromVR AMToLDS() ; 

ReadlnputFromLDSO; 
DoSomeComputations() ; 
WriteOutputToLDS() ; 

Read I n putFrom LDS() ; 
DoSomeComputations() ; 
WriteOutputToLDS() ; 

CopyFromLDSToVRAM(); 



The last copy 
takes all the time 

This doesn't 
make sense! 
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Beware the compiler 

• Data written in LDS 
are never used 

• The shader compiler 
detects it 

It removes the 
entire code 


CopyFromVR AMToLDS() ; 

ReadlnputFromLDSO; 
DoSomeComputationsO ; 
WriteOutputToLDS() ; 

ReadlnputFromLDSO; 
DoSomeComputationsO ; 
WriteOutputToLDS() ; 

//CopyFromLDSToVRAMO; 
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Optimizing compilation time 


float3 fan Blades 10]; 
for (uint i = 0; i < 10; ++i) 

{ 

Vertex fanVertex = GetVerte 
fanBlades[i] = fanVertex.m 

} 


Shader compilation time 


Loop 19" 

Manually unrolled 6 " 


float3 normalAccumulator = cross(fanBlades[0], fanBlades[1]); 
for (uint j = 0; j < 8; ++j) 

{ 

float3 triangleNormal = cross(fanBlades[j+1], fanBlades[j+2]); 
uint isTriangleFilled = neighborFan.m_ FilledFlags & (1 « j); 
if (isTriangleFilled) normalAccumulator += triangleNormal; 

} 
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Iteration time 

• It's really hard to know which code will run the 
fastest 

• The "best" method: 

• Write 10 versions of 
your feature 

• Test them 

• Keep the fastest one 


• Loops ordering 

• Which data to compress? 

• Which data to put in LDS? 
. Unroll loops? 

• Change data organization? 


A fast iteration time really helps 
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Bonus: final performance 

Xbox One PS4 
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Thank you! 


/h»TION 






